The goal of this notebook is to look at the haplotyped output from CliqueSNV with different parameters over different regions of the genome. I want to determine which parameters are optimal for a given set of tissues. I also want to see how well the tool performs generally. Ideally, I can use the output of this tool to stitch together a set of haplotypes for phylogenetic inference.
## ==== File paths input data ==== ##
# Data from lofreq and varscan labeled with subclonal haplotypes
updated.data = "../../results/variants/clustered_variants.csv"
# Haplotypes parsed from CliqueSNV
haplotype.data = "../../results/haplotypes/haplotypes.csv"
# Import the sample data
sample.data = "../../config/samples.csv"
# Annotations
annotations.filepath = "../../config/annotations.csv"
## ==== File paths output data ==== ##
# Path to save the figures for lab meeting
output.path = "../../results/spatial"
I first tried to cut the genome in 3000 bp chunks with 2000 bp overlaps. I did this with five different values for minimum O22 count (t in CliqueSNV) [20, 40, 80, 120, 360] and four different values minimum haplotype fraction (tf in CliqueSNV) [0.02, 0.05, 0.1, 0.25]. All of these combinations resulted in a significant fraction of haplotypes that were verifiably false, i.e. haplotypes that contained two mutually exclusive SNPs. There was no clear correlation between values of t and tf and the fraction of false haplotypes over the size of regions I choose. This led me to suspect that the specificity issue was the result of the size of the region being haplotyped, not the parameters. I re-did this for smaller regions of 500 bp with 250 bp overlaps and two values for t (10, and 100) and one value for tf (0.05).
# Upload the variants
updated.df = read_csv(updated.data, show_col_types = FALSE)
haplotypes.label = updated.df %>%
select(SNP, Haplotype, Background) %>%
distinct()
# Expand the mutations to have a frequency for every tissue.
expanded.df = updated.df %>%
select(SNP, Tissue, AF) %>%
pivot_wider(names_from = "Tissue", values_from = "AF", values_fill = 0) %>%
pivot_longer(cols = !SNP, names_to = "Tissue", values_to = "AF") %>%
left_join(., select(updated.df, c("SNP", "Tissue", "DP")), by = c("SNP", "Tissue")) %>%
mutate(DP = if_else(is.na(DP), 0, DP)) %>%
left_join(., haplotypes.label, by = "SNP")
# Get the mean frequency of the major genomes
tissue.mean = updated.df %>%
filter(Haplotype %in% c("genome-1", "genome-2")) %>%
group_by(Tissue, Haplotype) %>%
summarize(AF.mean = mean(AF, na.rm = TRUE),
SD = sd(AF, na.rm = TRUE),
N = n()) %>%
mutate(SE = SD / sqrt(N),
Lower.CI = qt(1 - (0.05 / 2), N - 1) * SE,
Upper.CI = qt(1 - (0.05 / 2), N - 1) * SE) %>%
rename("AF" = AF.mean)
# Join the haplotyping information to the variant calling data frame.
haplotype_df = read_csv(haplotype.data, show_col_types = FALSE) %>%
mutate(SNP = paste0(REF, POS, ALT),
Interval = paste(Start, Stop, sep = "-"),
Parameter_Combination = paste(`T`, TF, Start, Stop, sep = "-")) %>%
left_join(updated.df, ., by = c("POS", "REF", "ALT", "SNP", "Tissue"))
I consider a haplotype that is “false” as a haplotype with two mutually exclusive SNPs. I define mutually exclusive SNPs as those which are defined as primary G1 or G2 SNPs or SNPs that have been classified as G1 or G2 while clustering subclonal mutations.
## [1] "There are only 2.01 percent of total haplotypes with genome-1 and genome-2 mutations that are verifiably false."
## # A tibble: 6 × 2
## Tissue Interval
## <chr> <chr>
## 1 Cerebellum 3049-3549
## 2 Cerebellum Nucleus 4299-4799
## 3 Midbrain 4049-4549
## 4 Midbrain 4299-4799
## 5 Occipital Lobe 4049-4549
## 6 Parietal Lobe 4049-4549
The smaller regions work much better as an approach to accurately call haplotypes. There are only 3 haplotypes out of ~250 that are verifiably false. Interestingly, these are all pretty much in the same region of the genome (3049 - 4549). Is there something special about this region? Perhaps a significant decrease in coverage?
I doesn’t seem like the problem is a coverage issue. Aside from the fact that Cerebellum has less relative coverage to other samples, the is nothing special about this particular region.
Perhaps it has something to do with the SNPs involved and something is mis-annotated as genome-1 or genome-2? The mis-annotated SNP in the cerebellum probably has something to do with the low coverage and distance from other mutations. I’m not sure what’s going on with the other two. They involve the same mutation - C4502T.
We won’t include these intervals and haplotypes in subsequent analyses since we know that they’re false.
dont.include = false.haplotypes.to.explore %>%
mutate(cliqueSNV_haplotype = paste(Tissue, Interval, HAPID, sep = "-")) %>%
pull(cliqueSNV_haplotype)
So far, it seems like this version of CliqueSNV work alright, with the caveat that we don’t really know how to judge this other than by comparing it to other, worse performing parameters.
Now, let’s look at all of the mutations grouped with each of the subclonal clusters that we already identified. This could help us link some of these haplotypes together and identify haplotypes on the background of these already classified haplotypes.
subclonal.clusters = haplotype_df %>%
filter(grepl('cluster', Haplotype)) %>%
pull(Haplotype) %>%
unique() %>%
mixedsort()
t = 10
tf = 0.05
subclonal.plots.list = list()
for (i in 1:length(subclonal.clusters)) {
cluster = subclonal.clusters[i]
haplotypes.to.search = haplotype_df %>%
filter(`T` == t & TF == tf & !is.na(HAPID)) %>%
filter(Haplotype == cluster) %>%
mutate(cliqueSNV_haplotype = paste(Tissue, Interval, HAPID, sep = "-")) %>%
filter(!(cliqueSNV_haplotype %in% dont.include)) %>%
pull(cliqueSNV_haplotype) %>%
unique()
plot = haplotype_df %>%
filter(`T` == t & TF == tf & !is.na(HAPID)) %>%
mutate(cliqueSNV_haplotype = paste(Tissue, Interval, HAPID, sep = "-")) %>%
filter(cliqueSNV_haplotype %in% haplotypes.to.search) %>%
ggplot(aes(x = POS, y = AF, col = Haplotype, shape = Background)) +
geom_point(size = 2.5) +
facet_wrap(~(Tissue)) +
theme_bw(18)
subclonal.plots.list[[i]] = plot
}
There are no additional subclonal mutations identified with cluster 1 mutations. ### Cluster 2
It looks like cluster 3 might be on the background of cluster 2!
cluster 3 appears to be on the background of cluster 2, and there could be a subclonal mutation (C11173G) on the background of cluster 3 in the UBS.
cluster 4 appears to have a handful of subclonal mutations unique to the parietal lobe.
cluster 5 Looks like it has one subclonal mutations that could be on it’s background in the Frontal Cortex 2 sample.
Nothing super interesting for cluster 6.
Same for cluster 7, nothing super interesting here.
cluster 8 is interesting because it seems like it’s probably fixed on genome-2. There are some subclonal mutations that also appear fixed on genome-2 if Clique-SNV is to be believed.
cluster 9 is really interesting, unsurprisingly. This cluster is pretty much fixed on genome-1 in the Frontal Cortex 2 samples, but it’s at much lower frequency in the rest of the brain. This represents an early divergence in genome-1. It seems like cluster 13 and cluster 14 might be related to this haplotype, but I would take that with a grain of salt given that genome-1-1 also shows up, and that should be mutually exclusive.
It’s likely that the mutations that got grouped with cluster 9 but show up on reads with genome-1-1 could be mislabled as cluster 9. This needs more investigation.
Nothing interesting about cluster 10
Nothing interesting about cluster 11
cluster 12 is interesting because it looks like it’s on the background of quite an extensive set of subclonal mutations, and it might be grouped with cluster 13. These subclonal mutations are probably on the background of cluster 12.
cluster 13 is a similar deal..
cluster 14 is looking similarly crazy to cluster 9 and cluster 13. There is something to be resolved here.